For any baseball player, the highest individual achievement is induction into the Hall of Fame through the BBWAA. This is an exclusive club, with 323 inductees, including 226 former Major League Baseball players and 35 players and executives from the Negro League. For our last mini-project1, we used the Lahman dataset to explore statistical differences between those inducted in the Hall of Fame and those who were statistically on the cusp. We wanted to examine how the standards for HOF players have changed over time. Baseball as a game has changed considerably in the time it has been played, with greater emphasis being placed on hitting home runs, while strikeouts are at an all-time high. Additionally, the game has been tainted somewhat by the Steroid Era. We tried to make some suggestions about certain players that should be inducted into the Hall of Fame based on the statistics of those who have been inducted.
We spent considerable time working with SQL to narrow down our queries and create objects showing meaningful statistics comparing pitchers and hitters currently in the Hall of Fame to those not in the Hall of Fame.
Delving deeper into investigating which players have been closest to being inducted, we see that many of those with the strongest statistical resumes are those with clear connections to performance enhancing drugs (PEDs): the three highest HR totals for those not inducted in the Hall of Fame according to our search are Barry Bonds, Mark McGwire, and Manny Ramirez, players linked to PEDs in the past. This kind of domain knowledge beyond the numbers is needed to understand why certain players have not been elected to the Hall of Fame. As we have talked about before in this class, an important job of data scientists is to be culturally aware and think deeply about trends or patterns observed in the data.
library(mdsr)
library(RMySQL)
library(tidyverse)
library(plotly)
library(ggrepel)
library(gridExtra)
library(ggthemes)
library(plotly)
db <- dbConnect_scidb(dbname = "lahman")
pitchers <- db %>%
dbGetQuery(
"SELECT hof.playerID, hof.yearID, nameFirst, nameLast,
sum(G) as G, sum(pi.GS) as GS, sum(IPouts/3) as IP,
sum(CG) as CG, sum(SHO) as SHO, sum(SV) as SV, sum(W) as W,
sum(L) as L, sum(H) as H, sum(BB) as BB, sum(ER) as ER,
sum(SO) as SO, ((sum(ER) / (sum(IPouts)/3)) * 9) as ERA,
((sum(H) + sum(BB)) / (sum(IPouts) / 3)) as WHIP , hof.inducted
FROM HallOfFame hof
JOIN Master ma ON ma.playerID = hof.playerID
JOIN Pitching pi ON pi.playerID = hof.playerID
WHERE hof.inducted = 'Y'
AND hof.votedBy = 'BBWAA'
GROUP BY hof.playerID
HAVING G > 100;"
)
# Specifying BBWAA (Highest authority for bsaeball)
# Using index hof.inducted = 'Y' and G > 100 to make sure they didn't just pitch for one game
# Adding up all the pitching stats per player
pitchers
hitters <- db %>%
dbGetQuery("SELECT hof.playerID, hof.yearID, nameFirst,
nameLast, sum(AB) as AB, sum(R) as R, sum(H) as H,
sum(2B) as 2B, sum(3B) as 3B, sum(HR) as HR,
sum(RBI) as RBI, sum(SB) as SB,
(sum(H) / sum(AB)) as AVG , hof.inducted
FROM HallOfFame hof
JOIN Master ma ON ma.playerID = hof.playerID
JOIN Batting ba ON ba.playerID = hof.playerID
WHERE hof.inducted = 'Y'
AND hof.votedBy = 'BBWAA'
GROUP BY hof.playerID
HAVING AB > 2000;")
# Using indices AB > 2000 to sort for players that regularly bat, not some pitchers who've only batted a few times
# Adding up all the batting stats
hitters
batters_not_inducted <- db %>%
dbGetQuery("SELECT hof.playerID, nameLast, nameFirst,
count(DISTINCT hof.yearID) as num_ballots,
min(hof.yearid) AS first_year, ballots, needed,
votes, inducted, max(votes / ballots) as PCT,
sum(AB) / count(DISTINCT hof.yearID) as AB,
sum(R) / count(DISTINCT hof.yearID) as R,
sum(H) / count(DISTINCT hof.yearID) as H,
sum(2B) / count(DISTINCT hof.yearID) as 2B,
sum(3B) / count(DISTINCT hof.yearID) as 3B,
sum(HR) / count(DISTINCT hof.yearID) as HR,
sum(RBI) / count(DISTINCT hof.yearID) as RBI,
sum(SB) / count(DISTINCT hof.yearID) as SB,
(sum(H) / sum(AB)) as AVG,
sum(inducted = 'Y') as indct
FROM HallOfFame as hof
JOIN Batting ba ON ba.playerID = hof.playerID
JOIN Master ma ON ma.playerID = hof.playerID
WHERE hof.yearid > 1979
GROUP BY hof.playerID
HAVING indct = 0
AND AB > 4000
AND PCT > 0.2
ORDER BY PCT desc;")
batters_not_inducted
pitchers_not_inducted <- db %>%
dbGetQuery("SELECT hof.playerID, hof.yearID, ma.nameFirst, ma.nameLast, count(DISTINCT hof.yearID) as num_ballots, hof.ballots , hof.needed , hof.votes , max(votes/ballots) as PCT , hof.inducted,
sum(inducted = 'Y') as indct ,sum(G)/(count(DISTINCT hof.yearID)) as G, sum(GS)/(count(DISTINCT hof.yearID)) as GS, sum(IPouts/3)/(count(DISTINCT hof.yearID)) as IP,
sum(CG) / (count(DISTINCT hof.yearID)) as CG, sum(SHO) / (count(DISTINCT hof.yearID)) as SHO, sum(SV) / (count(DISTINCT hof.yearID)) as SV, sum(W) / (count(DISTINCT hof.yearID)) as W,
sum(L) / (count(DISTINCT hof.yearID)) as L, sum(H) / (count(DISTINCT hof.yearID)) as H, sum(BB) / (count(DISTINCT hof.yearID)) as BB, sum(ER) / (count(DISTINCT hof.yearID)) as ER,
sum(SO)/(count(DISTINCT hof.yearID)) as SO, ((sum(ER) / (sum(IPouts)/3)) * 9) as ERA,
((sum(H) + sum(BB)) / (sum(IPouts) / 3)) as WHIP
FROM lahman.HallOfFame AS hof
JOIN lahman.Master AS ma on ma.playerID = hof.playerID
JOIN lahman.Pitching AS pi on pi.playerID = hof.playerID
GROUP BY ma.playerID
HAVING indct = 0
AND PCT > 0.2
AND G > 100
ORDER BY PCT desc;")
pitchers_not_inducted
For Hall of Fame hitters, we looked at how their statistics may have changed over time to understand the BBWAA voters have evaluated players throughout the years. For the most part, HRs have increased while AVG and SBs have decreased. There has been a push in recent years for hitters to focus on launch angle and swing for the fences in a strategy that has not been seen to this extent in baseball before. This means more strikeouts and home runs for many hitters. Players like Mark Reynolds, Chris Davis, and Logan Morrison are poster childs for this revolution. That being said, the trends in the Hall of Fame statistics probably do not even fully represent this recent change because of the lag between the termination of a player’s career and five years later when they can first be inducted. So this strategic change may be even more readily apparent in a decade or two. Additionally, it should be noted that some players in the inducted batter graphs were inducted for their pitching ability, but may have played before the introduction of a DH.
#Added slugging percentage here to hitters by our own calculation
# because it wasn't included in the lahman statistics.
hitters <- hitters %>%
mutate(SLG = ((hitters$'2B' * 2) + (hitters$'3B' * 3) + (hitters$'HR' * 4) + (hitters$'H' - hitters$'2B' - hitters$'3B' - hitters$'HR')) / hitters$'AB')
hiit_1 <- plot_ly(hitters, x = ~yearID, color = I("black")) %>% # Showing the yearID when you hover over
add_markers(y = ~HR, text = ~paste('Player: ', nameFirst, "", nameLast, yearID), showlegend = FALSE) %>%
add_lines(y = ~fitted(loess(HR ~ yearID)), # Using the loess method for our trend line
line = list(color = '#07A4B5'), # Making it a distinct color to separate from the black points
name = "Trendline", showlegend = FALSE) %>%
layout(xaxis = list(title = 'Year'), # Year is on the x-axis
yaxis = list(title = 'HR')) # Home run on the y-axis
hiit_2 <- plot_ly(hitters, x = ~yearID, color = I("black")) %>%
add_markers(y = ~AVG, text = ~paste('Player: ', nameFirst, "", nameLast, yearID), showlegend = FALSE) %>%
add_lines(y = ~fitted(loess(AVG ~ yearID)),
line = list(color = '#07A4B5'),
name = "Trendline", showlegend = FALSE) %>%
layout(xaxis = list(title = 'Year'),
yaxis = list(title = 'AVG'))
hiit_3 <- plot_ly(hitters, x = ~yearID, color = I("black")) %>%
add_markers(y = ~SB, text = ~paste('Player: ', nameFirst, "", nameLast, yearID), showlegend = FALSE) %>%
add_lines(y = ~fitted(loess(SB ~ yearID)),
line = list(color = '#07A4B5'),
name = "Trendline", showlegend = FALSE) %>%
layout(xaxis = list(title = 'Year'),
yaxis = list(title = 'SB'))
hiit_4 <- plot_ly(hitters, x = ~yearID, color = I("black")) %>%
add_markers(y = ~SLG, text = ~paste('Player: ', nameFirst, "", nameLast, yearID), showlegend = FALSE) %>%
add_lines(y = ~fitted(loess(SLG ~ yearID)),
line = list(color = '#07A4B5'),
name = "Loess Smoother", showlegend = FALSE) %>%
layout(xaxis = list(title = 'Year'),
yaxis = list(title = 'SLG'))
subplot(hiit_1, hiit_2, hiit_3 , hiit_4 , nrows = 2, margin = 0.06, heights = c(.55, .45), titleY = TRUE, titleX = TRUE) # Arranging all of the graphs together
Moving on to the pitchers, the most interesting findings are that ERA and Strikeouts have generally increased while wins and complete games have decreased. As noted before, hitters in recent years try to hit more home runs at the expense of striking out more. With ERA, the bump in earned runs given up by pitchers can be explained by the fact that hitters have more strength and conditioning than ever before, and because the use of advanced scouting provides an in-depth look revealing the smallest tell in a pitcher’s form. Complete games and wins have similarly decreased steadily among those inducted to the Hall of Fame. Overall, teams nowadays have specialized relief pitchers who expend more effort every pitch/inning, allowing high quality relievers to take over the game in late innings. Additionally, the wear and tear pitching fast has on arms, and the increased number of reconstructive surgeries like Tommy John surgery also play into the fewer number of complete games.
# Looking at wins over the years
piit_1 <- plot_ly(pitchers, x = ~yearID, color = I("black")) %>%
add_markers(y = ~W, text = ~paste('Player: ', nameFirst, "", nameLast, yearID), showlegend = FALSE) %>%
add_lines(y = ~fitted(loess(W ~ yearID)),
line = list(color = '#07A4B5'),
name = "Trendline", showlegend = FALSE) %>%
layout(xaxis = list(title = 'Year'),
yaxis = list(title = 'W'))
piit_2 <- plot_ly(pitchers, x = ~yearID, color = I("black")) %>%
add_markers(y = ~SO, text = ~paste('Player: ', nameFirst, "", nameLast, yearID), showlegend = FALSE) %>%
add_lines(y = ~fitted(loess(SO ~ yearID)),
line = list(color = '#07A4B5'),
name = "Trendline", showlegend = FALSE) %>%
layout(xaxis = list(title = 'Year'),
yaxis = list(title = 'SO'))
piit_3 <- plot_ly(pitchers, x = ~yearID, color = I("black")) %>%
add_markers(y = ~ERA, text = ~paste('Player: ', nameFirst, "", nameLast, yearID), showlegend = FALSE) %>%
add_lines(y = ~fitted(loess(ERA ~ yearID)),
line = list(color = '#07A4B5'),
name = "Trendline", showlegend = FALSE) %>%
layout(xaxis = list(title = 'Year'),
yaxis = list(title = 'ERA'))
piit_4 <- plot_ly(pitchers, x = ~yearID, color = I("black")) %>%
add_markers(y = ~CG, text = ~paste('Player: ', nameFirst, "", nameLast, yearID), showlegend = FALSE) %>%
add_lines(y = ~fitted(loess(CG ~ yearID)),
line = list(color = '#07A4B5'),
name = "Loess Smoother", showlegend = FALSE) %>%
layout(xaxis = list(title = 'Year'),
yaxis = list(title = 'CG'))
subplot(piit_1, piit_2, piit_3 , piit_4 , nrows = 2, margin = 0.06, heights = c(.55, .45), titleY = TRUE, titleX = TRUE)
# Showing all of the pitching stats over time in one graphic
# Defining a font, this is from the plotly package. We wanted an interactive graphic
f <- list(
family = "Times New Roman" ,
size = 18,
color = "#7f7f7f"
)
# Choosing the x-axis
era_x <- list(
title = "ERA",
titlefont = f
)
# Choosing the y-axis
wins_y <- list(
title = "W",
titlefont = f
)
In terms of pitchers not in the Hall of Fame, Roger Clemens sticks out. His overall number of wins and ERA most resemble those of Greg Maddux, a historically great pitcher and unquestionable Hall of Famer. His career strikeouts are also surpassed only by Nolan Ryan and Randy Johnson. However, Clemens’ career is tainted with PED allegations from his personal trainer.
Roger Clemens
For batters, there are a few more interesting cases. Obviously, Barry Bonds — the major league leader in home runs with almost a .300 batting average and an incredible and fearsome hitter — is the first name that jumps off the graphic. The main reason he is not in the Hall of Fame is because of steroid use, but statistically speaking he deserves to be there. It is our belief that players such as Bonds and Clemens and other players known to have used PEDs should be in the Hall of Fame because they are “inexorably woven into the fabric of the game.” 2 No players in recent memory compare “in ability or accomplishment — not even close — to either Bonds or Clemens.” “[t]he Hall of Fame’s mission is to preserve the sport’s history, honor excellence within the game” and a “Hall of Fame without either of those two is a Hall of Fame that doesn’t tell the full story of the game.”
Barry Bonds
Another interesting case is Edgar Martinez. His average of .312 and 309 career home runs put him statistically similar to Hall of Famers like George Brett, Ryne Sandberg, and Ivan Rodriguez. The main knock against Edgar is not anything to do with steroids, but because he spent the majority of his career as a designated hitter, meaning he rarely played the field and did not contribute defensively. Again from a statistical perspective, it appears he belongs in the Hall of Fame. Edgar was one of the most feared hitters of his era, playing a memorable role in both the Mariners’ 1995 postseason run, as well as their historic 116-win season in 2001. It seems unfortunate that his reputation of not frequently playing the field will prevent a player that is one of the better hitters in this generation from reaching the Hall of Fame.
Edgar Martinez
# Plotting the non-inducted pitchers
non_hof_pitch <- plot_ly(pitchers_not_inducted, x = ~ERA, y = ~W, type = 'scatter', mode = 'markers',
text = ~paste('Player: ', nameFirst , "" , nameLast, yearID , inducted)) %>%
layout(
xaxis = list(range = c(2, 4)),
yaxis = list(range = c(0, 550))) %>%
layout(xaxis = era_x , yaxis = wins_y)
# Plotting for hof_pitchers
hof_pitch <- plot_ly(pitchers, x = ~ERA, y = ~W, type = 'scatter', mode = 'markers',
text = ~paste('Player: ', nameFirst , "" , nameLast, yearID , inducted)) %>%
layout(
xaxis = list(range = c(2, 4)),
yaxis = list(range = c(0, 550))) %>%
layout(xaxis = era_x , yaxis = wins_y)
p_arranged <- subplot(hof_pitch , non_hof_pitch , titleX = TRUE , titleY = TRUE) %>%
layout(showlegend = FALSE)
p_arranged
both_pitchers <- db %>%
dbGetQuery("SELECT sum(inducted = 'Y') as indct, hof.playerID, hof.yearID, ma.nameFirst, ma.nameLast, count(DISTINCT hof.yearID) as num_ballots, hof.ballots , hof.needed , hof.votes , max(votes/ballots) as PCT , hof.inducted,
sum(G)/(count(DISTINCT hof.yearID)) as G, sum(GS)/(count(DISTINCT hof.yearID)) as GS, sum(IPouts/3)/(count(DISTINCT hof.yearID)) as IP,
sum(CG) / (count(DISTINCT hof.yearID)) as CG, sum(SHO) / (count(DISTINCT hof.yearID)) as SHO, sum(SV) / (count(DISTINCT hof.yearID)) as SV, sum(W) / (count(DISTINCT hof.yearID)) as W,
sum(L) / (count(DISTINCT hof.yearID)) as L, sum(H) / (count(DISTINCT hof.yearID)) as H, sum(BB) / (count(DISTINCT hof.yearID)) as BB, sum(ER) / (count(DISTINCT hof.yearID)) as ER,
sum(SO)/(count(DISTINCT hof.yearID)) as SO, ((sum(ER) / (sum(IPouts)/3)) * 9) as ERA,
((sum(H) + sum(BB)) / (sum(IPouts) / 3)) as WHIP
FROM lahman.HallOfFame AS hof
JOIN lahman.Master AS ma on ma.playerID = hof.playerID
JOIN lahman.Pitching AS pi on pi.playerID = hof.playerID
GROUP BY ma.playerID
HAVING PCT > 0.2
AND G > 100
ORDER BY PCT desc;")
plot_ly(both_pitchers, x = ~ERA, color = I("black")) %>%
add_markers(y = ~W, text = ~paste('Player: ', nameFirst, "", nameLast), color = ~inducted, showlegend = TRUE) %>%
add_lines(y = ~fitted(loess(W ~ ERA)),
line = list(color = '#000000'),
name = "Loess Smoother", showlegend = FALSE) %>%
layout(xaxis = list(title = 'W'),
yaxis = list(title = 'ERA'))
avg_x <- list(
title = "AVG",
titlefont = f
) # Defining the x-axis title
hr_y <- list(
title = "HR",
titlefont = f
) # Defining the y-axis title
# HOF Batters
h_1 <- plot_ly(hitters, x = ~AVG, y = ~HR, type = 'scatter', mode = 'markers',
text = ~paste('Player: ', nameFirst , "" , nameLast , inducted)) %>%
layout(xaxis = avg_x , yaxis = hr_y) %>%
layout( xaxis = list(range = c(.2, .40)),
yaxis = list(range = c(0, 800))) %>%
layout(xaxis = avg_x , yaxis = hr_y)
# Non-HOF Batters
h_2 <- plot_ly(batters_not_inducted, x = ~AVG, y = ~HR, type = 'scatter', mode = 'markers',
text = ~paste('Player: ', nameFirst , "" , nameLast , inducted)) %>%
layout(
xaxis = list(range = c(.2, .40)),
yaxis = list(range = c(0, 800))) %>%
layout(xaxis = avg_x , yaxis = hr_y)
h_arranged<- subplot(h_1 , h_2 , titleX = TRUE , titleY = TRUE) %>%
layout(showlegend = FALSE)
h_arranged
both_hitters <- db %>%
dbGetQuery("SELECT hof.playerID, nameLast, nameFirst,
count(DISTINCT hof.yearID) as num_ballots,
min(hof.yearid) AS first_year, ballots, needed,
votes, inducted, max(votes / ballots) as PCT,
sum(AB) / count(DISTINCT hof.yearID) as AB,
sum(R) / count(DISTINCT hof.yearID) as R,
sum(H) / count(DISTINCT hof.yearID) as H,
sum(2B) / count(DISTINCT hof.yearID) as 2B,
sum(3B) / count(DISTINCT hof.yearID) as 3B,
sum(HR) / count(DISTINCT hof.yearID) as HR,
sum(RBI) / count(DISTINCT hof.yearID) as RBI,
sum(SB) / count(DISTINCT hof.yearID) as SB,
(sum(H) / sum(AB)) as AVG
FROM HallOfFame as hof
JOIN Batting ba ON ba.playerID = hof.playerID
JOIN Master ma ON ma.playerID = hof.playerID
WHERE hof.yearid > 1979
GROUP BY hof.playerID
HAVING AB > 4000
AND PCT > 0.2
ORDER BY PCT desc;")
plot_ly(both_hitters, x = ~AVG, color = I("black")) %>%
add_markers(y = ~HR, text = ~paste('Player: ', nameFirst, "", nameLast), color = ~inducted, showlegend = TRUE) %>%
add_lines(y = ~fitted(loess(HR ~ AVG)),
line = list(color = '#000000'),
name = "Loess Smoother", showlegend = FALSE) %>%
layout(xaxis = list(title = 'AVG'),
yaxis = list(title = 'HR'))
inducted_all_star_appearances <- db %>%
dbGetQuery("SELECT tab1.playerID, tab1.nameFirst, tab1.nameLast, tab1.total_games, tab1.firstyear, tab1.lastyear, tab1.diff,
HallOfFame.inducted
FROM HallOfFame,
((SELECT af.playerID, nameFirst, nameLast,
sum(af.GP) as total_games, GP,
min(af.yearID) as firstyear, max(af.yearID) as lastyear, max(af.yearID)-min(af.yearID) as diff
FROM lahman.AllstarFull as af
JOIN Master as ms ON ms.playerID = af.playerID
GROUP BY af.playerID
ORDER BY GP desc, total_games desc) as tab1)
WHERE HallOfFame.playerID=tab1.playerID and HallOfFame.inducted = 'Y'
GROUP BY playerID order by total_games desc;")
#Had to use a subquery here because of problems with rows being multiplied
#Adding up the total number of all-star games played in per player
not_inducted_all_star_appearances <- db %>%
dbGetQuery("SELECT tab1.playerID, tab1.nameFirst, tab1.nameLast, tab1.total_games, tab1.firstyear, tab1.lastyear, tab1.diff,
HallOfFame.inducted
FROM HallOfFame,
((SELECT af.playerID, nameFirst, nameLast,
sum(af.GP) as total_games, GP,
min(af.yearID) as firstyear, max(af.yearID) as lastyear, max(af.yearID)-min(af.yearID) as diff
FROM lahman.AllstarFull as af
JOIN Master as ms ON ms.playerID = af.playerID
GROUP BY af.playerID
ORDER BY GP desc, total_games desc) as tab1)
WHERE HallOfFame.playerID=tab1.playerID and HallOfFame.inducted!='Y'
GROUP BY playerID order by total_games desc;")
All-star appearances is another important statistic to consider because it illustrates how a player compared to his competition, with more dominating players making more all star appearances for many years. This graphic illustrates that there are many non-inducted players that have been to more all-star games than inducted counterparts: among inducted players, the average number of all-star appearances is 7.22, and there are an astonishing 44 non-inducted players above this bar.
inducted_all_star_appearances <- inducted_all_star_appearances %>%
mutate(full_name = paste(nameFirst, nameLast, sep = " "))
#adding a full name variable
inducted_all_star_appearances$full_name <- factor(inducted_all_star_appearances$full_name, levels = unique(inducted_all_star_appearances$full_name)[order(inducted_all_star_appearances$total_games, decreasing = FALSE)])
inducted_all_star_appearances
#ordering the data by the total all-star games appearances descending so on the bar chart they will show up in order
yes<- plot_ly(data = inducted_all_star_appearances,
x = ~full_name, y = ~total_games, type = "bar", name = 'Inducted') %>%
layout(
yaxis = list(range = c(0,25), title = "All-Star Game Appearances"),
xaxis = list(
title = "Player",
showticklabels = FALSE))
not_inducted_all_star_appearances <- not_inducted_all_star_appearances %>%
mutate(full_name = paste(nameFirst, nameLast, sep = " "))
not_inducted_all_star_appearances$full_name <- factor(not_inducted_all_star_appearances$full_name, levels = unique(not_inducted_all_star_appearances$full_name)[order(not_inducted_all_star_appearances$total_games, decreasing = FALSE)])
no <- plot_ly(data = not_inducted_all_star_appearances,
x = ~full_name, y = ~total_games, type = "bar", name = 'Not Inducted') %>%
layout(
yaxis = list(range = c(0,25), title = "All-Star Game Appearances"),
xaxis = list(
title = "Player",
showticklabels = FALSE))
#had to get rid of the x-axis labels because not all the names fit
not_inducted_all_star_appearances
subplot(yes, no, nrows = 1, margin = 0.06, heights = 1, shareY = TRUE, titleX = TRUE)
avg_inducted_all_star_appearances <- inducted_all_star_appearances%>%
summarize(mean(total_games))
avg_inducted_all_star_appearances
# calculating the average number of all-star appearances for inducted players
above_seven_not_inducted_all_star_appearances <- db %>%
dbGetQuery("SELECT tab1.playerID, tab1.nameFirst, tab1.nameLast, tab1.total_games, tab1.firstyear, tab1.lastyear, tab1.diff,
HallOfFame.inducted
FROM HallOfFame,
((SELECT af.playerID, nameFirst, nameLast,
sum(af.GP) as total_games, GP,
min(af.yearID) as firstyear, max(af.yearID) as lastyear, max(af.yearID)-min(af.yearID) as diff
FROM lahman.AllstarFull as af
JOIN Master as ms ON ms.playerID = af.playerID
GROUP BY af.playerID
ORDER BY GP desc, total_games desc) as tab1)
WHERE HallOfFame.playerID=tab1.playerID and HallOfFame.inducted!='Y'
AND total_games > 7
GROUP BY playerID order by total_games desc;")
dim(above_seven_not_inducted_all_star_appearances)
## [1] 44 8
# finding the number of players (rows) that have been to more all-star games than the average inducted player